Xception-Deep Learning with Depthwise Separable Convolutions

Xception是google继Inception后提出的对Inception v3的另一种改进,主要是采用depthwise separable convolution来替换原来Inception v3中的卷积操作。
paper link

Introduction

Xception继承自Inception,下图是Inception v3 module的示意图:

Figure  1.  A  canonical  Inception  module  (Inception  V3)

作者提出了两种卷积的概念:

  • cross-channel correlations :使用N个1 x 1 x input_channels的卷积核计算卷积,得到N个feature map;这一步实际上是在计算不同通道之间的相关性。
  • spatial correlations :得到N个feature map之后,在每一个维度的feature map上单独用一个k x k的卷积核计算卷积;这一步实际上是在计算空间相关性。

这两步实际上是将传统的卷积计算分成两个步骤,将学习空间相关性和学习通道间相关性的任务完全分离。

作者提出假设:

assume that cross-channel correlations and spatial correlations can be mapped completely separately

根据以上假设,作者提出了以下结构:

Figure  2. An  “extreme”  version  of  our  Inception  module,  with  one spatial  convolution  per  output  channel  of  the  1x1  convolution.

Figure 2展示的结构先使用N个1 x 1 x input_channels的卷积核计算卷积,得到N个feature map(cross-channel correlations );然后在每一个维度的feature map上单独用一个k x k的卷积核计算卷积(spatial correlations )。

这种结构与depthwise separable convolution(深度可分离卷积)类似:下图就是depthwise separable convolution的示意图,其实就是将传统的卷积操作分成两步,假设原来是3_3的卷积,那么depthwise separable convolution就是先用M个3_3_1的卷积核分别与输入的M个channel进行卷积运算,生成M个feature map;然后用N个1_1*M的卷积核与上一步生成的M个feature map进行卷积运算,最后生成N个feature map。因此文章中将depthwise separable convolution分成两步,一步叫depthwise convolution,就是下图的(b),另一步是pointwise convolution,就是下图的(c)。

Figure 2与depthwise separable convolution的区别在于:

  1. 顺序不一样:在depthwise separable convolution中是先进行一个channel-wise的spatial convolution,也就是上图的(b),然后是1_1的卷积。而在Figure4中是先进行1_1的卷积,再进行channel-wise的spatial convolution,最后concat。
  2. 是否存在非线性激活:在Figure 2中,每一步卷积操作后都有一个ReLU的非线性激活,但是在depthwise separable convolution中没有。

The Xception architecture

The  Xception  architecture:  the  data  first  goes  through  the  entry  flow,  then  through  the  middle  flow  which  is  repeated  eight  times, and  finally  through  the  exit  flow.  Note  that  all  Convolution  and  SeparableConvolution  layers  are  followed  by  batch  normalization  [7]  (not included  in  the  diagram).  All  SeparableConvolution  layers  use  a  depth  multiplier  of  1  (no  depth  expansion)

_这里的sparsableConv就是depthwise separable convolution_

In short, the Xception architecture is a linear stack of depthwise separable convolution layers with residual connections.

_code link : https://keras.io/applications/#xception_

Experiment

作者选取Inception v3与Xception做比较,因为二者的网络参数数量大致相同,因此任何性能上的差异可归结于两种结构本身的不同。作者共选取了两个任务:

  • 1000-class single-label classification task on the ImageNet dataset
  • 17,000-class multi-label classification task on the large-scale JFT dataset.

具体的训练细节参加原文:https://arxiv.org/abs/1610.02357

Classification performance

选取部分结果如下:

Table1.Classification performance comparison on ImageNet(single  crop,  single  model).  VGG-16  and  ResNet-152  numbers  are only  included  as  a  reminder.  The  version  of  Inception  V3  being benchmarked  does  not  include  the  auxiliary  tower.

Table  2.  Classification  performance  comparison  on  JFT  (single crop,  single  model).

The Xception architecture shows a much larger performance improvement on the JFT dataset compared to the
ImageNet dataset. We believe this may be due to the fact that Inception V3 was developed with a focus on ImageNet and may thus be by design over-fit to this specific task. On the other hand, neither architecture was tuned for JFT. It is likely that a search for better hyperparameters for Xception on ImageNet (in particular optimization parameters and regularization parameters) would yield significant additional improvement.

Size and speed

Table  3.  Size  and  training  speed  comparison.

Conclusions

We showed how convolutions and depthwise separable convolutions lie at both extremes of a discrete spectrum,
with Inception modules being an intermediate point in between. This observation has led to us to propose replacing Inception modules with depthwise separable convolutions in neural computer vision architectures. We presented a novel architecture based on this idea, named Xception, which has a similar parameter count as Inception V3. Compared to Inception V3, Xception shows small gains in classification performance on the ImageNet dataset and large gains on the JFT dataset. We expect depthwise separable convolutions to become a cornerstone of convolutional neural network architecture design in the future, since they offer similar properties as Inception modules, yet are as easy to use as regular convolution layers.

Reference